Skip to content

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25

Draft
handecelikkanat wants to merge 6 commits intomainfrom
feat/s3-access-via-warcio1.8
Draft

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
handecelikkanat wants to merge 6 commits intomainfrom
feat/s3-access-via-warcio1.8

Conversation

@handecelikkanat
Copy link
Copy Markdown
Contributor

@handecelikkanat handecelikkanat commented Apr 9, 2026

From https://github.com/commoncrawl/issues/issues/684

This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.

Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst

This PR adds:

  • fsspec.open call to replace local open call in warcio-iterator.py
  • New make targets:
    • make iterate-remote to remote access the example whirlwind.warc.gz file in Github repo directly over https:
    • make cdxj-remote-https and make cdxj-remote-s3 to index two EoT WARCs over https and s3
    • make extract-remote-https and make extract-remote-s3 to extract records from the two EoT WARCs over https and s3
    • Note: Still keeping processing over local files in the tutorial, as a gentle start.
  • New requirement warcio[s3]>=1.8.0
  • New CI steps: run: make iterate-remote, run: make cdxj-remote-https, run: make extract-remote-https (No testing of s3 versions, which requires AWS creds)

@malteos
Copy link
Copy Markdown

malteos commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

fsspec_open call from warcio.utils

This seems unnecessary. You could open the remote files directly via fsspec. No need to use the warcio utils.

Previously this used a local file open, Ill check fsspec.

To illustrate the S3 support of warcio, you could call the warcio CLI directly without the custom python script.

I was now thinking that warcio extract should be working with remote files as well. Ill modify that task: cdx index extract info -> warcio extract over (local and) remote files.

Any other suggestions? warcio index looks potentially confusable with cdx index to me, because of "index" label.

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

@handecelikkanat
Copy link
Copy Markdown
Contributor Author

handecelikkanat commented Apr 10, 2026

@malteos Can cdxj-indexer work with remote files (maybe through warcio) now?

I guess this is not guaranteed. I see that they include warcio but not s3, and dont force > 1.8.0: https://github.com/webrecorder/cdxj-indexer/blob/9ad2b9e1c54d2d20c391050fdb831ca1ee981504/setup.py#L49

Ill continue assuming it needs to work on local files.

EDIT: Explained by Greg that this can be handled by making the requirement stricter in whirlwind side ✔️

@handecelikkanat handecelikkanat force-pushed the feat/s3-access-via-warcio1.8 branch from ac6d444 to 7d99aee Compare April 12, 2026 16:29
@handecelikkanat handecelikkanat force-pushed the feat/s3-access-via-warcio1.8 branch from bd16c58 to 157eeec Compare April 13, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants